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This  thesis  presents  an  algorithm  for  determining  the  connectivity  of  a  set 
of  N  rectangles  in  the  plane,  a  problem  eentral  to  avoiding  aliasing  in  VLSI 
design  rule  checkers.  Previous  algorithms  for  this  problem  either  worked  slowly 
with  a  small  amount  of  primary  memory  space,  or  worked  quickly  but  used  more 
space.  The  algorithm  presented  here,  based  upon  a  technique  called  scanning, 
operates  in  0(N  lg  N)  time  in  the  worst  case.  This  matohes  the  running  time  of 
the  best  known  sequential  algorithm  for  this  problem.  Because  we  use  a  machine 
model  that  explicitly  incorporates  secondary  memory,  the  new  connected 
components  algorithm  avoids  unexpected  disk  thrashing  which  leads  to  lower 
performance.  The  algorithm  uses  0(V)  primary  memory  space,  where  V,  the  scan 
width,  is  the  maximum  number  of  rectangles  to  oross  any  vertical  cut.  It 
requires  no  more  than  O(N)  transfers  between  primary  and  secondary  memory. 

When  a  vertical  line  passes  through  a  set  of  rectangles,  those  rectangles 
cut  by  the  line  form  a  set  of  line  segments.  The  key  to  development  of  space- 
efficient  algorithms  using  a  two  layer  memory  model  is  that  appropriate 
manipulations  of  these  segments  alone  can  solve  more  complicated  problems  such 
as  the  connected  components  problem.  This  thesis  introduces  interval  trees,  a 
simple,  sparse,  data  structure  for  storing  a  set  of  k  line  segments.  With  this 
data  structure,  a  variation  on  a  balanced  search  tree,  one  oan  perform  each  of 
the  following  operations  in  0(lg  k)  time  in  the  worst  case:  1)  Insert  a  new 
segment,  2)  delete  a  segment,  and  3)  given  a  test  interval,  return  a  segment 
which  intersects  that  test  interval  or  return  nil  if  there  is  no  such  segment. 
This  data  structure  is  used  in  the  new  connected  components  algorithm.  It  oan 
also  be  used  to  improve  other  existing  algorithms  for  computational  geometry 
problems . 
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Abstract 

This  thesis  presents  an  algorithm  for  determining  the  connectivity  of  a  set  of  N 
rectangles  in  the  plane,  a  problem  central  to  avoiding  aliasing  in  VLSI  design  rule 
checkers.  Previous  algorithms  for  this  problem  either  worked  slowly  with  a  small 
amount  of  primary  memory  space,  or  worked  quickly  but  used  more  space.  The  algo¬ 
rithm  presented  here,  based  upon  a  technique  called  scanning,  operates  in  0(N  Ig  N) 
time  in  the  worst  case.  This  matches  the  running  time  of  the  best  known  sequen¬ 
tial  algorithm  for  this  problem.  Because  we  use  a  machine  model  that  explicitly  in¬ 
corporates  secondary  memory,  the  new  connected  components  algorithm  avoids  un¬ 
expected  disk  thrashing  which  leads  to  lower  performance.  The  algorithm  uses  0[W) 
primary  memory  space,  where  W,  the  scan  width,  is  the  maximum  number  of  rectangles 
to  cross  any  vertical  cut.  It  requires  no  more  than  O(N)  transfers  between  primary 
and  secondary  memory. 

When  a  vertical  line  passes  through  a  set  of  rectangles,  those  rectangles  cut  by  the 
line  form  a  set  of  line  segments.  The  key  to  development  of  space-efficient  algorithms 
using  a  two  layer  memory  model  is  that  appropriate  manipulations  of  these  segments 
alone  can  solve  more  complicated  problems  such  as  the  connected  components  problem. 
This  thesis  introduces  interval  trees,  a  simple,  sparse,  data  structure  for  storing  a  set 
of  k  line  segments.  With  this  data  structure,  a  variation  on  a  balanced  search  tree,  one 
can  perform  each  of  the  following  operations  in  0(lg  k)  time  in  the  worst  case:  1)  insert 
a  new  segment,  2)  delete  a  segment,  and  3)  given  a  test  interval,  return  a  segment 
which  intersects  that  test  interval  or  return  nil  if  there  is  no  such  segment.  This  data 
structure  is  used  in  the  new  connected  components  algorithm.  It  can  also  be  used  to 
improve  other  existing  algorithms  for  computational  geometry  problems. 
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f  Introduction 

For  a  VLSI  design  to  be  reliably  produced  as  a  working  chip,  various  features  on 
the  chip  must  be  separated  by  minimum  distances  to  ensure  the  proper  operation  of 
transistors  and  interconnections.  The  design  rule  checker  program  verifies  that  these 
and  other  geometric  constraints  are  satisfied  and  signals  an  error  if  it  finds  two  features 
that  violate  the  design  rules.  For  a  chip  composed  of  millions  of  rectangles,  design  rule 
checking  is  a  time-consuming  process  which  cannot  be  done  entirely  within  the  primary 
memory  of  many  computers. 

This  thesis  presents  an  efficient  algorithm  for  finding  the  connected  components  of 
rectangles  in  the  plane  using  a  machine  model  which  incorporates  the  secondary  disk 
memory  where  the  VLSI  design  is  stored,  lly  running  this  algorithm  simultaneously  on 
each  layer  of  a  VLSI  chip  design,  a  design  rule  checker  can  determine  which  features  of 
a  chip  design  are  electrically  equivalent,  t.e.,  arc  effectively  part  of  the  same  wire.  The 
determination  of  electrical  equivalence  allows  the  design  rule  checker  to  avoid  reporting 
the  many  aliasing  errors  which  occur  when  two  electrically  equivalent  features  are 
mistaken  for  electrically  distinct  features.  For  example,  two  wires  might  be  too  close 
together,  but  if  they  are  actually  the  same  wire,  it  docs  not  matter,  — - - 

Many  VLSI  design  systems  use  rcctilincarly  oriented  rectangles  to  represent  the 
design  features.  Two  rectangles  arc  electrically  equivalent  if  they  arc  connected  by  a 
path  of  intersecting  rectangles.  The  connected  components  problem  is  to  label  each 
rectangle  in  a  design  such  that  two  rcctangtcs  have  the  same  label  if  and  only  if  they 
are  in  the  same  connected  component.  The  set  of  rectangles  in  Figure  1,  for  instance, 
has  three  connected  components:  {A,B,D,E,G},  {C,  F},  and  {//}. 

The  connected  components  of  N  rectangles  in  the  plane  can  be  determined  in 
0{N  lg  N)  time  by  an  algorithm  due  to  Guibas  and  Saxe  jlj.  Their  algorithm  uses  the 
technique  of  scanning,  introduced  by  Shamos  and  lloey  Its],  which  assumes  that  the 
vertical  edges  of  rectangles  are  initially  sorted  by  7-coordinate.  Scanning  algorithms 
work  by  sweeping  a  scanlinc  over  a  set  of  geometric  objects  in  the  plane  and  then 
working  primarily  with  the  objects  crossed  by  the  scanlinc.  In  the  Guibas-Saxe  algo¬ 
rithm,  the  scanlinc  is  a  vertical  line  that  sweeps  from  left  to  right  over  the  rectangles 
(Imai  and  Asano  [5]  also  have  an  0(.Y  lg.Y)  connected  components  algorithm  for  the 
primary  memory  model  which  is  not  based  on  scanning.) 

The  0(N  lg  Ar)  running  time  that  Guibas  and  Saxe  achieve  is  remarkable  in  that 
there  may  be  as  many  as  order  Ar 2  rectangle  intersections.  Unfortunately,  the  Guibas- 
Saxe  algorithm  is  designed  to  run  entirely  within  primary  memory,  and  it  may  cause 
disk  thrashing  for  a  large  VLSI  chip. 

In  this  thesis,  we  abandon  the  simple  primary  memory  model,  and  instead  use  a 
machine  model  which  includes  a  secondary  disk  memory  as  well  as  primary  memory. 
The  configuration  is  shown  in  Figure  2.  We  assume  that  the  primary  memory  is  a 
fast,  random-access  memory  of  limited  size.  The  set  of  rectangles  is  kept  in  a  file 
in  secondary  disk  storage.  Accesses  to  the  file  arc  presumed  to  be  sequential,  either 
forward  or  backward.  More  general  random  accesses  to  disk  blocks  arc  unnecessary 
for  our  algorithm. 


/ 


This  mode)  is  used  by  Szymanski  and  Van  Wyk  [10]  for  a  connected  components 
algorithm,  a  special  ease  of  their  algorithm  for  connectivity  analysis  of  more  general 
regions.  Their  algorithm  is  more  suitable  for  large  rectangle  databases  because  it  uses 
less  primary  memory  than  the  Guibas-Saxe  algorithm  and  has  locality  of  reference 
for  secondary  memory.  The  amount  of  primary  memory  space  used  by  the  algorithm 
is  O(IV),  where  IV,  the  scan  width,  is  the  largest  number  of  rectangles  cut  by  any 
scanline.  In  practice,  Szymanski  and  Van  Wyk  comment,  the  size  of  IV  is  about 
0(y/N).  Unfortunately,  their  algorithm  is  based  on  rectangle  intersections,  and  the 
running  time  can  be  as  large  as  O(NW). 

This  thesis  presents  a  connected  components  algorithm  that  combines  and  op* 
timizes  the  Szymanski  and  Van  Wyk  and  the  Guibas  and  Saxe  algorithms.  It  uses 
0(W)  (primary  memory)  space  and  runs  in  0(N  Ig  ,V)  time  in  the  worst  case. 

The  algorithm  consists  of  a  two-pass  scan  over  the  set  of  rectangles.  Most  of  the 
work  is  done  in  the  first,  forward  scan.  A  backward  scan  is  then  used  to  produce  the 
labeling  of  rectangles  such  that  two  rectangles  have  the  same  label  if  and  only  if  they 
are  in  the  same  connected  component.  The  algorithm  maintains  four  data  structures 
of  size  0(\V)  during  its  forward  scan. 

The  first  chapter  of  this  thesis  presents  the  connected  components  algorithm 
and  its  analysis.  The  second  chapter  describes  a  data  structure  and  algorithms  for 
the  implementation  of  the  scan  set,  one  of  the  data  structures  used  in  the  forward 
scan  of  the  connected  components  algorithm.  Some  sort  of  scan  set  appears  in  every 
scanning  based  algorithm  for  solving  problems  with  rectangles.  In  particular,  the 
new  data  structure,  interval  trees,  could  be  used  by  Guibas  and  Saxe  to  simplify  the 
implementation  of  their  algorithm.  The  introduction  to  chapter  2  discusses  some  the 
the  previous  implementations  of  scan  sets. 


Figure  1:  A  set  of  rectangles  with  connected  components  { A,B,D,E,G}t  { C,F }, 
and  {H).  On  the  the  right  is  shown  a  scan  set  at  the  time  rectangle  E  enters.  Only 
active  rectangles  (those  crossed  by  the  scanlinc)  have  an  interval  in  the  scan  set.  The 
interval  for  E  will  be  entered  in  the  scan  set  5  after  all  processing  for  its  enter  event 
is  complete. 


Figure  2:  The  computer  model  includes  a  secondary  disk  memory  as  well  as 
primary  memory.  The  connected  components  algorithm,  which  assumes  the  rectangle 
database  is  on  disk,  uses  0{N)  references  to  sequential  files,  O(IK)  primary  memory 
space,  and  0[N  Ig  N)  CPU  time. 


1.  The  Connected  Components  Algorithm 

This  chapter  presents  the  connected  components  algorithm  and  its  analysis.  Sections 
1,  2,  3,  and  4  describe  the  four  data  structures  used  during  the  forward  scan.  Section 
5  gives  the  algorithm,  section  6  proves  its  correctness,  section  7  analyses  the  time  and 
space  requirements  of  the  algorithm,  and  Section  8  offers  some  remarks. 

1.1.  Rectangle  set 

In  scanning  algorithms  an  event  is  a  geometric  phenomenon  that  causes  some 
computation  at  the  time  when  it  occurs.  There  are  two  types  of  events  for  a  left-to- 
right  scan:  a  start  event  when  the  scanlinc  crosses  the  left  boundary  of  a  rectangle 
(the  rectangle  becomes  active,  or  enters )  and  an  end  event  when  the  scanline  crosses 
the  right  boundary  of  a  rectangle  (the  rectangle  becomes  inactive,  or  leaves).  Each 
rectangle  has  an  associated  start  event  and  end  event. 

There  are  two  technical  issues  to  be  resolved  during  scanning.  The  first  is  manage¬ 
ment  of  active  rectangles,  and  the  second  is  the  sorting  of  events. 

The  rectangle  set  A  is  a  dynamic  set  that  contains  the  active  rectangles  at  any 
point  during  the  scan.  The  problem  is  that  the  association  between  start  and  end 
events  must  be  maintained  in  small  primary  memory  spare.  We  assume  that  each 
rectangle  in  the  disk  file  has  a  unique  identification  number.  When  a  rectangle  enters 
primary  memory,  it  is  stored  in  the  set  R  with  the  identification  number  as  a  key.  The 
rectangle  set  can  be  maintained  as  a  balanced  search  tree,  using  0(lV)  space.  Each 
insertion,  deletion,  or  search  takes  0(lg  to')  =  0(lg  N)  time. 

The  scanning  part  of  our  algorithm  assumes  the  events  are  sorted  by  z-coordinate. 

If  not,  the  events  must  be  sorted.  This  takes  0(iYlgAr)  time  in  the  worst  case,  but 
most  computer  systems  do  have  a  fast  disk  sort.  Much  of  the  time,  we  can  do  much 
better  because  many  VLSI  databases  already  keep  red  angles  sorted  by  left  edge. 

Given  a  file  sorted  by  left  edge  alone,  we  can  sort  it  into  start  and  end  events 
is  O(NlgN)  time  and  O(W)  space  using  an  idea  due  to  Srymanski  and  Van  Wyk 
[lO].  The  idea  is  to  keep  a  priority  queue,  such  as  a  heap  [l,  p.  117-152],  in  primary 
memory.  During  the  operation  of  the  algorithm,  the  priority  queue  holds  at  most  H'  +  l 
rectangles  sorted  by  right  endpoint.  When  a  new  rectangle  is  read  in.  its  right  endpoint 
is  stored  in  the  priority  queue.  Then  the  priority  queue  is  emptied  of  all  rectangles 
with  right  endpoint  smaller  than  the  left  endpoint  of  the  new  rectangle.  For  each  of 
these  rectangles,  the  right  endpoint  is  written  out  in  order  as  an  end  event.  Then  the 
left  endpoint  c.  the  new  rectangle  is  written  out  as  a  start  event.  Thus,  without  loss 
of  generality,  we  can  assume  the  start  and  end  events  are  presorted. 

There  are  other,  more  mundane  data  management  issues  to  be  faced  in  the  course 
of  programming  the  connected  components  algorithm  described  here.  Most  of  these 
can  be  resolved  using  simple  pointer  associations,  but  the  more  complicated  will  be 
addressed  directly  in  the  sections  to  come. 

1.2.  Scan  set 

We  now  turn  our  attention  to  the  data  structure  that  maintains  the  scanlinc  for 
the  connected  components  algorithm.  At  any  point  during  the  forward  scan,  the  active 


rectangles  can  be  represented  as  a  set  of  vertical  intervals,  t.e.,  an  interval  in  y.  For 
example,  Figure  1  shows  the  intervals  of  the  active  rectangles  at  the  time  rectangle 
E  enters.  The  scan  set  S  maintains  the  dynamic  set  of  intervals  that  represents  the 
active  rectangles. 

The  scan  set  allows  the  connected  components  algorithm  to  determine  rectangle 
intersections  easily.  Two  rectangles  intersect  if  and  only  if  there  is  a  scanline  that 
crosses  both  rectangles,  and  their  intervals  overlap  in  the  scan  set  corresponding  to 
the  scanlinc.  This  technique  for  determining  rectangle  intersections  is  well  known  and 
is  used  in  previous  scan-based  algorithms  for  determining  rectangle  intersections  or 
connected  components  [3],  [4],  [10]. 

To  be  precise,  a  scan  set  5  supports  the  following  operations: 

S-INSERTl(A) 

Add  rectangle  A  to  the  scan  set. 

S-DELETEl(A) 

Remove  rectangle  A  from  the  scan  set. 

S-FIND(/) 

Returns  a  rectangle  in  the  scan  set  5  that  overlaps  interval  /  in  some  way,  and 
NIL  if  no  rectangles  overlap  I. 

The  number  of  rectangles  stored  in  5  at  any  given  time  during  a  scan  is  at  most 
the  scan  width  W.  We  can  implement  each  of  the  three  operations  in  time  O(lgW) 
using  space  0[W)  with  interval  trees.  This  data  structure  is  described  in  chapter  2. 

1.3.  Component  set 

During  the  forward  scan,  the  connected  components  algorithm  maintains  a  com¬ 
ponent  set  Q  that  reflects  our  current  knowledge  of  the  connectivity  of  the  active  rec¬ 
tangles.  Each  component  is  designated  by  a  color,  which  for  convenience  is  represented 
as  an  integer.1 

The  rectangle  colorings  within  the  component  set  Q  may  change  with  a  start  event. 
If  a  new  rectangle  connects  two  previously  unconnected  components,  we  merge  them 
within  the  component  set  Q  by  recoloring  active  rectangles  in  the  smaller  of  the  two. 

The  component  set  Q  supports  the  following  operations: 

COLOR!(A) 

Assigns  rectangle  A  a  new  (unused)  color. 

UNCOLOR!(A) 

Dissociates  rectangle  A  from  others  of  its  color.  If  A  is  the  last  of  its  color,  the 
color  is  destroyed  (made  available  for  reuse). 

COLOR(A) 

Returns  A’s  color. 

REPRESENTATIVE^) 

Returns  any  rectangle  having  color  q.  If  there  is  no  such  rectangle,  return  NIL. 


'The  Idler  Q  is  mnemonic  for  "qonnected  qoinponcnls"  and  “qolor."  The  first  letters  of  the  alphabet 
arc  re.termJ  for  rectangles. 


RECOLORIfoi,^) 

Takes  all  rectangles  of  color  q\  and  color  92  and  makes  them  all  either  color  91  or 
color  92-  The  other  color  is  destroyed. 

We  implement  the  component  set  Q  using  a  vector  in  which  each  color  is  repre¬ 
sented  as  an  index  in  the  vector.  Each  slot  in  the  vector  contains  a  pointer  to  the 
first  rectangle  in  a  doubly  linked  list  of  all  rectangles  of  that  color,  and  the  number 
of  rectangles  in  the  list.  The  pointers  to  implement  the  linked  lists  can  be  stored  with 
the  actual  rectangles.  Each  rectangle  also  stores  the  index  of  its  color.  If  the  number 
field  is  zero,  the  color  is  unused,  and  we  then  use  the  pointer  field  to  implement  a  free 
list  of  the  unused  colors.  An  extra  variable  is  needed  to  store  the  head  of  the  free  list. 

All  operations  except  RECOLOR!  can  be  implemented  in  constant  time.  If  we 
always  merge  the  color  with  the  smaller  number  of  rectangles  into  the  one  with  the 
larger  number,  then  we  can  do  O(.V)  rccolorings  in  0(iV  IgA')  time.  There  are  at  most 
W  rectangles  in  the  component  set  Q  at  any  given  time  so  the  data  structure  need  only 
be  size  0(W). 

1.4.  Territory  set 

To  achieve  an  O(A'lgiV)  worst  case  running  time  for  the  connected  components 
algorithm,  we  must  find  a  way  to  maintain  the  component  set  Q  without  looking  at 
every  intersection.  Figure  3  shows  the  basic  idea.  The  active  rectangles  D,  C,  and  D 
have  the  same  color,  say  1.  The  new  rectangle  E  intersects  all  three  of  these  rectangles, 
which  tells  us  that  rectangle  E  should  be  given  the  same  color  as  rectangle  H.  all 
rectangles  with  B’s  color  should  be  merged  with  rectangles  of  rectangle  C’s  color,  etc. 
We  would  get  the  same  result,  however,  if  we  just  noticed  that  rectangle  E  intersects 
some  rcctangle(s),  all  of  color  1.  That  is.  instead  of  asking,  “What  other  rectangles 
docs  rectangle  E  intersect?’’  we  would  like  to  be  able  to  a«.k.  ‘Tor  what  color  q  in  the 
component  set  Q  does  rectangle  E  intersect  at  least  one  rectangle  colored  9?"  We  now 
describe  a  new  data  structure  called  a  territory  set  T  that  will  allow  us  to  answer  this 
question  using  small  space  and  time. 

The  territory  set  T  is  a  refinement  of  the  illuminator  data  structure  used  by 
Guibas  and  Saxe  in  their  algorithm  for  the  connected  components  problem  [•!).  The 
territory  set  is  essentially  a  colored  partition  {f,}  of  the  scanline.  Conceptually,  each 
territory  has  two  fields:  its  interval  and  its  color.  The  interval  is  a  closed  interval  in  y. 
We  implement  the  color  indirectly  by  associating  with  each  territory  a  representative 
rectangle  which  is  in  the  territory,  and  therefore  has  the  same  color  as  the  territory. 
Each  territory  t  in  T  obeys  the  following  rules: 

1.  Each  active  rectangle  is  covered  by  exactly  one  territory. 

2.  Each  territory  covers  at  least  one  active  rectangle.  To  ensure  that  the  territory 
set  is  never  empty,  wc  assume  there  is  a  dummy  rectangle  above  all  rectangles  in 
the  data  base  that  extends  the  full  length  of  the  design. 

3.  All  active  rectangles  covered  by  territory  t  have  the  same  color  as  t. 
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Figure  3:  The  inefficiencies  that  can  arise  from  intersection- based  connectivity 
algorithms.  Colors  of  active  rectangles  are  represented  as  circled  numbers.  When 
rectangle  E  enters  we  would  like  to  know  it  should  have  the  same  color  as  each  rectangle 
colored  1  without  recognizing  this  fact  three  dilTcrcnt  times  via  intersection  checks. 


For  example,  in  Figure  4  no  rectangles  go  across  the  boundary  between  territories 
t\  and  t-2-  Each  territory  covers  at  least  one  active  rectangle.  Each  active  rectangle's 
color  corresponds  to  the  color  of  the  territory  that  covers  it.  Here,  rectangles  A,  E  and 
G  and  territory  t\  that  covers  them  are  colored  17.  Rectangles  C  and  F  and  territory 
<2  are  colored  42. 

The  territory  set  T  supports  the  following  operations: 

T- INSERT!^) 

Add  territory  t  to  the  territory  set. 

T-DELETEI(t) 

Delete  territory  t  from  the  territory  set. 

LOCATE(y) 

Returns  the  territory  that  includes  the  y-coordinate  y.  If  the  point  y  falls  on  the 
boundary  between  two  territories,  the  lower  of  the  two  is  returned. 

NEXT(t) 

Returns  the  territory  immediately  above  territory  t. 

COLOR  (t) 

Returns  the  color  of  territory  t.  This  operation  involves  getting  t’s  representative 
rectangle  and  getting  the  color  from  the  rectangle. 

The  territory  set  T  can  be  implemented  as  a  standard  height-balanced  tree  using 
0(1V)  space.  The  operations  T-INSERT!,  T-DELETE!,  LOCATE,  and  NEXT  can  each 
be  implemented  in  0(lg  W)  =  0(lg  N)  time.  As  a  simple  optimisation,  the  territories 
can  be  linked  in  order,  which  allows  NEXT  to  run  in  constant  time. 


A  6 


B 


- w  ■ 


H 


E  (17 


kvi 


Sconline 


Figure  4:  The  territory  set  (right)  for  a  collection  of  rectangle*  (left)  n  essentially 
a  colored  partition  of  the  scanline.  Colors  of  active  rectangle*  and  territories  are 
represented  as  circled  numbers. 
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1.5.  The  Algorithm 

The  connected  components  algorithm  operates  in  two  phases.  The  first  phase  is 
a  forward  scan  over  the  rectangles  during  which  connectivity  information  is  prepared 
that  is  written  out  to  an  intermediate  sequential  file  on  disk.  The  second  phase  is  a 
backward  scan  over  the  intermediate  file  during  which  component  labels  are  assigned 
to  each  rectangle. 

1.5.1.  The  forward  scan 

The  data  structures  used  by  the  forward  scan  contain  only  those  rectangles  that 
arc  active,  which  ensures  that  the  0[W)  space  bound  is  met,  but  which  also  leads  to 
problems  maintaining  connectivity  across  the  entire  database.  When  we  see  an  end 
event  for  a  rectangle  A  signaling  that  A  is  to  become  inactive,  we  arc  not  prepared 
to  give  A  a  final  label,  yet  we  must  purge  A  from  our  internal  data  structures.  For 
example,  at  the  time  rectangles  A  and  C  in  Figure  5  become  inactive,  there  is  no  way 
to  guess  that  they  are  in  the  same  connected  component.  Were  we  to  give  them  final 
labels  now,  we  would  incorrectly  give  them  distinct  labels. 

Since  we  cannot  give  each  rectangle  A  a  final  label  in  the  forward  scan,  we  give 
it  a  friend.  Rectangle  A’s  friend  is  another  rectangle  which  (1)  is  active  at  the  time 
rectangle  A  leaves,  and  (2)  is  known  to  be  in  the  same  connected  component  as  rectangle 
A.  If  there  is  no  such  rectangle  at  the  time  rectangle  A  leaves,  then  its  friend  is  NIL. 
Figure  5  shows  a  possible  assignment  of  friends. 


At  the  end  of  the  forward  pass,  each  connected  component  is  linked  together  by  a 
tree  of  friend  arrows.  From  this  friend  information,  the  back  pass  can  construct  final 
component  labels.  The  idea  is  that  each  friend  arrow  points  from  left  to  right  if  the 
source  and  destination  rectangles  are  sorted  by  right  edge,  or  equivalently,  fay  time  of 
exit.  Thus,  a  component  label  assigned  to  the  root  of  the  tree  will  prop  agate  right  to 
left  through  the  tree  during  the  back  scan. 

The  start  event 

Processing  a  start  event  for  rectangle  A  during  the  forward  scan  involves  font 
steps:  setting  up,  handling  top  and  bottom  boundary  conditions,  recoloring  elected 
rectangles,  and  cleaning  up. 

Set  up.  Figure  6  shows  the  important  y-coordinates  and  intervals  for  the  general 
case.  The  bottom  and  top  coordinates  of  A’s  interval  are  designated  ft*  and  yup.  The 
endpoints  of  the  k  territories  in  the  territory  set  T  that  A  overlaps  are  yo,yi, 

The  k  territories  are  gathered  into  a  list  L  by  first  using  LOCATE  to  find  the  territory 
that  includes  y 6of ,  and  then  using  NEXT  to  gather  the  remaining  territories  that  overlap 
A’s  interval  ( yoottVtop ]•  All  the  territories  in  L  arc  then  removed  from  T,  which  leaves 
a  gap  in  T  from  yo  to  y*.  This  gap  will  be  repaired  in  subsequent  steps. 

Intuitively,  the  colors  of  the  territories  in  list  L  represent  our  first  guess  at  which 
colors  must  be  merged  due  to  the  entrance  of  rectangle  A.  Since  each  territory  contains 
at  least  one  active  rectangle,  the  territories  in  the  middle  of  the  list  will  necessarily 
contain  a  rectangle  that  intersects  A. 

Handle  boundary  conditions.  Rectangle  A  extends  only  partially  into  the  top  and 
bottom  territories,  so  we  must  explicitly  reference  the  scan  set  S  to  determine  whether 
there  arc  active  rectangles  in  these  two  territories  that  intersect  A.  We  describe  only 
the  handling  of  the  top  boundary  condition  since  the  bottom  boundary  condition  is 
symmetric.  Also,  for  simplicity,  we  shall  consider  the  special  case  k  =  1  (Figure  7) 
after  we  deal  with  the  general  case  k  >  2  (Figure  6). 

Handling  the  top  boundary  condition  for  k  >  2  involves  determining  whether 
the  top  territory  should  be  kept  in  list  L.  The  first  case  is  when  there  is  some  active 
rectangle  B  that  intersects  the  interval  (yfc-i,ytop]*  The  interval  of  the  rectangle  B 
falls  entirely  within  the  top  territory,  so  it  follows  that  A,  B,  and  every  other  active 
rectangle  covered  by  the  top  territory  must  have  the  same  color  by  the  time  we  finish 
processing  the  entrance  of  A.  Therefore,  we  leave  the  top  territory  in  the  list  L,  and 
nothing  is  to  be  done. 

Otherwise,  no  active  rectangle  intersects  A  in  the  top  territory,  and  the  top 
territory  is  removed  from  L.  Since  k  is  at  least  2,  there  must  be  an  active  rectangle 
in  the  interval  [y<0p,  Vk\  because  the  top  territory  must  contain  at  least  one  rectangle, 
and  the  interval  [y*-i,ytop)  contains  none.  Therefore,  we  can  return  the  top  territory 
to  the  territory  set  with  the  shortened  interval  [ytop,  yt)  without  violating  any  of  the 
properties  a  territory  must  have.  In  other  words,  chopping  off  empty  space  does  not 
hurt. 


Figure  6:  The  territory  set  is  shown  on  the  left  and  the  rectangles  on  the  right 
for  the  case  when  k  >  1.  The  colors  of  territories  •.'>{*  we  a  first  guess  at  the 
colors  to  merge  because  of  rectangle  A'a  entrance. 


We  now  discuss  the  processing  of  the  top  boundary  condition  for  the  special  case 
when  k  =  1  (Figure  7),  since  once  again,  the  bottom  boundary  condition  is  symmetric. 
If  rectangle  A  intersects  some  active  rectangle,  then  we’re  done.  If  rectangle  A  does 
not  intersect  any  active  rectangle,  it  is  possible  that  the  rectangle  that  justified  the 
existence  of  the  single  territory  in  list  L  is  below  rectangle  A,  instead  of  above.  In  this 
case,  we  must  explicitly  query  the  scan  set  5  with  the  interval  [yutp,  yi)  to  determine 
whether  there  is  an  active  rectangle  to  justify  putting  a  territory  over  the  interval.  If 
there  is  an  active  rectangle,  we  must  enter  a  new  territory  into  T  with  the  shortened 
interval  [yc0p>  Vi }  using  the  color  of  the  old  territory. 

Recolor.  Now,  the  colors  of  the  territories  in  list  L  are  exactly  the  colors  that 
must  be  merged  because  of  rectangle  A’s  entrance.  We  first  color  rectangle  A  with 
a  new  color.  We  then  merge  A’s  color  with  the  color  of  each  territory  in  L.  The 
colors  are  automatically  garbage  collected  by  the  component  set  Q.  Because  of  our 
pointer  implementation  of  territory  colors,  no  territory  will  ever  be  colored  with  a 
garbage- collected  color. 

Clean  vp.  We  finish  the  servicing  of  rectangle  A’s  entrance  by  repairing  the 
territory  set  T  and  making  A  active.  The  gap  left  after  handling  boundary  conditions 
becomes  the  interval  of  a  new  territory  with  the  color  of  rectangle  A.  Rectangle  A  is 
inserted  in  the  rectangle  set  R  and  the  scan  set  5.  Since  the  left  side  of  a  rectangle 
indicates  an  end  event  in  the  back  scan,  enter  an  end  event  for  rectangle  A  in  the 
intermediate  file  that  will  serve  as  input  to  the  back  scan. 

The  end  event 

Servicing  an  end  event  for  rectangle  A  requires  us  first  to  find  the  associated 
rectangle  object  for  A  in  the  rectangle  set  R.  Then,  we  must  output  a  start  event 
for  A  in  the  back  pass  and  fix  up  the  internal  data  structures.  We  accomplish  this 
processing  in  three  steps:  making  rectangle  A  inactive,  associating  A  with  a  friend, 
and  fixing  the  territory  set  T. 

Make  A  inactive.  Let  q  be  rectangle  A’s  color  before  processing  the  end  event. 
Uncolor  rectangle  A,  and  remove  it  from  the  scan  set  S  and  the  rectangle  set  R. 

Find  a  friend.  Query  the  component  set  Q  for  a  representative  of  color  q.  Associate 
this  representative  rectangle  (possibly  NIL)  with  rectangle  A  so  that  A  can  now  tell  its 
friend  when  asked.  Write  out  this  information  as  a  start  event  for  rectangle  A  for  use 
in  the  back  scan.  We  shall  say  that  rectangles  that  recieve  NIL  as  a  friend  have  no 
friend  or  are  friendless. 

Fix  the  territory  set.  Pick  any  point  on  rectangle  A’s  interval,  and  use  LOCATE 
to  find  the  one  territory  t  that  covers  rectangle  A.  Find  a  rectangle  D  in  the  scan  set  S 
that  intersects  t’s  interval  to  see  if  there  is  some  active  rectangle  to  justify  t’s  existence. 
(Recall  that  a  territory  must  cover  at  least  one  active  rectangle.)  If  no  rectangle  exists, 
then  A  is  the  last  active  rectangle  in  t’s  interval,  and  territory  t  can  be  eliminated  by 
extending  the  interval  of  the  next  territory  above  t  to  include  f’s  interval. 

If  the  existence  of  tenitory  t  is  justified  by  some  active  rectangle  B,  and  A  is 
serving  as  the  representative  (or  territory  t,  then  make  rectangle  B  the  representative 
of  territory  t. 


The  second  phase,  the  back  scan,  passes  backwards  through  the  intermediate  file 
of  rectangle-friend  information  created  in  the  forward  scan,  producing  a  final  file  of 
rectangle- label  pairs  which  will  be  sorted  by  left  edge.  During  this  right-to-left  scan, 
each  rectangle  receives  its  final  labeling  from  its  friend.  The  back  scan  uses  only  one 
data  structure,  the  rectangle  set  R.  It  also  requires  a  counter  initialized  to  0. 

During  the  back  scan,  the  rectangle  set  R  holds  all  active  rectangles,  and  each 
active  rectangle  knows  its  final  component  label.  Labels  arc  assigned  sequentially 
during  the  back  scan,  and  the  counter  holds  the  value  of  the  next  label  to  be  assigned. 

The  start  event 

The  first  step  in  servicing  a  start  event  for  rectangle  A  is  to  assi**>  a  final  label 
to  A.  If  A  has  no  friend,  it  is  the  rightmost  rectangle  in  its  component,  and  so  a  new 
label  must  be  assigned  from  the  counter.  Store  this  label  into  A  and  increment  the 
counter. 

Otherwise,  find  rectangle  A’s  friend  in  the  rectangle  set  R,  and  give  A  the  same 
label  as  its  friend.  Rectangle  A’s  friend  must  be  active  since  A  and  its  friend  wrre 
simultaneously  active  in  the  forward  scan.  Rectangle  A  left  first  in  the  forward  scan  so 
it  must  enter  after  its  friend  in  the  back  scan.  Finally,  add  rectangle  A  to  the  rectangle 
set  R. 

The  end  event 

Processing  an  end  event  for  a  rectangle  A  consists  of  simply  removing  A  from 
the  rectangle  set  R  and  writing  out  rectangle  A  with  its  label  to  a  final  file.  No 
rectangle  that  subsequently  enters  has  A  as  a  friend  because  the  two  rectangles  are  not 
simultaneously  active.  Thus,  no  other  rectangle  will  need  to  get  a  label  from  A,  and 
hence,  it  is  safe  to  remove  A.  The  Gnal  file  is  sorted  by  left  edge  from  right  to  left. 
Reversing  the  file  leaves  it  file  sorted  left  to  right  by  left  edge  as  was  the  original  input 
file. 

1.6.  Proof  of  correctness 

This  section  shows  that  two  rectangles  get  the  same  label  if  and  only  if  they  are 
in  the  same  connected  component. 

(=»)  We  first  show  that  if  two  rectangles  are  given  the  same  label,  then  they  are  in  the 
same  connected  component.  We  prove  this  by  induction  on  the  number  of  rectangles 
given  the  same  label.  Suppose  rectangle  A  is  the  first  rectangle  given  label  l.  Then 
at  the  time  we  process  rectangle  A’s  start  event  during  the  back  scan,  rectangle  A  is 
friendless  and  the  value  of  the  counter  is  l.  If  rectangle  A  had  a  friend,  it  would  be 
given  the  same  label  as  its  friend  contradicting  our  assumption  that  rectangle  A  was 
the  first  to  receive  its  label.  The  counter  is  incremented  after  rectangle  A  is  given  the 
label  /  so  no  friendless  rectangles  to  enter  after  A  will  get  the  label  /.  By  the  same 
argument,  no  friendless  rectangle  to  enter  before  rectangle  A  could  have  been  given 
the  label  l. 

Assume  that  at  some  point  in  the  backscan,  k  rectangles  have  been  given  label 
l  and  all  k  are  in  the  same  connected  component.  Some  ;  <  k  of  these  rectangles 


are  active  (i.  e.  are  in  the  rectangle  set  R).  Now  the  start  event  for  some  rectangle  B 
causes  B  to  get  label  l.  For  this  to  happen,  rectangle  fl  must  have  a  friend  rectangle 
C  which  is  one  of  the  j  active  rectangles  with  label  l.  Since  rectangle  C  is  rectangle 
£’s  friend,  both  rectangles  B  and  C  must  have  had  the  same  color  in  the  component 
set  Q  at  the  time  rectangle  £  left  in  the  forward  scan. 

To  finish  the  argument,  we  must  show  that  two  rectangles  simultaneously  having 
the  same  color  in  the  component  set  Q  must  be  in  the  same  connected  component. 
If  this  is  true,  then  rectangles  £  and  C  are  in  the  same  connected  component  and 
therefore  by  transitivity  rectangle  £  is  in  the  same  connected  component  as  the  other 
k  rectangles  given  label  l. 

We  show  that  two  rectangles  sharing  a  color  in  the  component  set  Q  must  be  in  the 
same  connected  component  by  induction  on  the  number  of  rectangles  with  that  color. 
A  new  color  is  introduced  into  the  component  set  Q  only  when  a  COLOR!  operation 
is  performed  upon  a  rectangle  A  during  its  start  event,  lienee  each  color  begins  with 
only  one  member  rectangle.  Other  rectangles  join  a  color  only  through  the  MERGE! 
operations  performed  during  the  processing  of  the  start  event  for  a  rectangle. 

Assume  that  before  processing  the  start  event  for  a  rectangle  A ,  all  rectangles  with 
the  same  color  in  the  component  set  Q  arc  in  the  same  connected  component.  After 
handling  the  boundary  conditions,  there  are  m  >  0  territories  in  the  list  L.  These 
territories  have  n  <  m  distinct  colors  91,921  •  •  9n  which  are  all  merged  into  one  final 
color  9.  The  colors  91,92, .  ■  •<?*  arc  exactly  those  colors  for  which  at  least  one  member 
rectangle  intersects  rectangle  A.  Each  pair  of  rectangles  in  the  final  color  9  is  connected 
by  a  path  of  intersecting  rectangles.  If  they  shared  a  color  9,  in  the  component  set  Q 
before  rectangle  A  entered,  then  by  assumption  there  is  a  path  connecting  them  that 
includes  only  rectangles  originally  colored  qi-  Otherwise,  there  is  a  path  between  the 
two  rectangles  that  includes  rectangle  A.  Therefore  all  rectangles  now  colored  9  are  in 
the  same  connected  component. 

(«=)  We  now  prove  that  if  two  rectangles  are  in  the  same  connected  component,  then 
they  get  the  same  label.  It  suffices  to  show  that  if  two  rectangles  intersect  they  get  the 
same  label  because  then  all  rectangles  in  the  same  connected  component  get  the  same 
label  by  transitivity.  The  proof  has  two  parts.  First,  we  argue  that  if  two  rectangles 
intersect,  then  during  the  forward  scan  they  have  the  same  color  in  the  component  set 
Q  while  they  are  both  active.  Then  we  show  that  if  two  rectangles  are  simultaneously 
active  and  have  the  same  color  in  the  component  set  Q  then  they  get  the  same  label. 

If  two  t -dangles  A  and  £  intersect,  they  have  the  same  color  in  the  component  set 
Q  w  hile  they  are  both  active.  Assume  without  loss  of  generality  that  rectangle  £  enters 
after  rectangle  A.  Let  t  be  the  territory  in  the  territory  set  T  that  covers  rectangle  A 
at  the  time  rectangle  £  enters.  Since  rectangles  A  and  £  intersect,  territory  t  must  at 
least  partially  cover  rectangle  £  so  it  will  be  gathered  into  the  list  L  in  the  first  step  of 
the  processing  of  the  start  event  for  rectangle  B.  The  presence  of  the  active  rectangle  A 
intersecting  rectangle  £  guarantees  that  after  the  boundary  condition  checks,  territory 
t  will  still  be  in  the  list  L.  Therefore  after  the  merging,  rectangles  A  and  £  will  be  the 
same  color  in  the  component  set  Q.  From  that  point  on  they  will  always  move  together 
in  any  recolorings,  so  they  will  always  have  the  same  color  until  one  of  them  leaves. 


If  two  rectangles  are  in  the  same  color  in  the  component  set  Q  while  they  are 
active,  then  they  will  get  the  same  final  label.  Suppose  a  rectangle  A  is  about  to  leave. 
Consider  the  set  of  rectangles  that  have  the  same  color  as  rectangle  A.  Rectangle  A 
chooses  one  as  a  friend  (shown  by  an  arrow  in  figure  8).  Later,  other  rectangles  may 
join  this  set  through  merges.  As  each  rectangle  in  the  set  leaves,  it  chooses  a  friend 
from  among  those  left  in  the  set.  Eventually  an  exiting  rectangle  finds  itself  alone,  and 
it  exits  without  a  friend. 

Figure  8  shows  one  such  set  of  rectangles  taken  from  the  example  in  figure  5. 
During  the  forward  scan  each  of  these  rectangles  simultaneously  shares  a  color  in  the 
component  set  Q  with  at  least  one  other  rectangle  in  the  set.  For  example  rectangles 
A  and  B  share  a  component  immediately  after  rectangle  B  enters  and  rectangles  B,  E, 
and  F  share  a  component  immediately  after  rectangle  F  enters. 

We  can  view  the  illustration  in  figure  8  as  an  acyclic  graph  with  the  rectangles  as 
vertices  and  the  friend  relation  arrows  as  directed  edges.  Each  vertex  has  outdegree 
one  except  for  a  single  sink,  the  friendless  rectangle  H.  If  we  start  at  any  vertex  in  the 
graph  and  follow  the  edges,  we  always  end  up  at  the  sink.  We  know  from  our  previous 
argument  that  a  rectangle  gets  the  same  label  as  its  friend.  That  friend  in  turn  gets 
the  same  label  as  its  friend,  . . .  (down  the  friend  links) . . .,  who  gets  the  same  label  as 
the  sink  H.  By  transitivity  any  two  rectangles  that  are  in  the  same  component  of  the 
component  set  Q  while  active  will  get  the  same  final  label. 


Increasing  time  of  exit  on  forward  scon 


Figure  8s  An  example  of  a  component  taken  from  figure  5.  The  arrows  represent 
the  friend  relation.  Each  rectangle  shares  a  component  color  with  at  least  one  other 
rectangle  in  this  set  during  the  forward  scan.  During  the  back  scan,  these  rectangles 
enter  from  right  to  left.  Rectangle  II  receives  a  new  label,  and  all  the  other  rectangles 
receive  their  labels  indirectly  from  rectangle  H . 


1.7.  Analysis 

This  section  shows  that  the  worst-case  running  time  of  the  connected  components 
algorithm  is  0(ArlgAT),  the  amount  of  primary  memory  required  is  0(W),  and  the 
number  of  transfers  between  primary  and  secondary  memory  is  O(.V).  We  have  already 
seen  that  each  data  structure  requires  only  0(\V)  primary  memory  space,  and  it  can 
be  verified  that  the  number  of  disk  transfers  is  0(N ).  Thus,  wc  must  demonstrate  that 
the  running  time  of  the  algorithm  is  0(N\gN). 

The  rectangle  set  R  and  the  scan  set  5  each  contribute  only  O(AflgAf)  to  the 
overall  time.  The  rectangle  set  R  is  used  in  both  the  forward  scan  and  the  back  scan. 
It  contributes  only  0{N\gN)  to  the  time  in  each  phase  since  it  performs  at  most  two 
operations,  each  requiring  O(lgA')  time,  on  each  of  the  O(.V)  start  and  end  events. 
The  scan  set  5  performs  one  insertion  or  deletion  and  at  most  four  S-FIND  operations 
for  each  start  or  end  event. 

Operations  on  the  territory  set  T  contribute  0(  .V  IgA')  time  as  well.  During  the 
servicing  of  an  end  event,  the  territory  set  T  performs  at  most  one  LOCATE,  two  T- 
DELETEJ’s,  and  one  T-INSERT!,  if  wc  regard  the  modification  of  a  territory  interval 
as  a  deletion  followed  by  an  insertion.  For  a  start  event,  only  one  LOCATE  and  at 
most  three  T-IN'SERT!:s  are  performed.  The  number  of  calls  to  NEXT,  T-DELETE!, 
and  COLOR  directly  depends  on  the  size  of  the  list  L,  however. 

We  shall  show  that  each  operation  is  performed  at  most  O(A’)  times.  The  opera¬ 
tions  that  are  performed  a  constant  number  of  times  for  a  given  event  arc  executed 
0[N)  times  overall.  The  other  operations  arc  called  once  for  each  time  a  territory 
appears  in  a  list  L  during  a  start  event.  Thus,  showing  that  the  sum  total  of  the  sizes 
of  L  throughout  the  entire  forward  pass  is  O(.V)  will  produce  our  desired  bound.  The 
total  number  of  insertions  into  T  is  at  most  4.V,  which  therefore  bounds  the  total 
number  of  deletions.  Moreover,  each  of  these  territories  can  participate  in  a  list  L  only 
once  since  it  is  deleted  from  T  at  that  time  and  replaced  by  the  consolidated  territory 
or  a  new  boundary  territory.  Hence,  the  sum  total  of  the  lengths  of  L  is  0(.Y),  which 
also  bounds  the  number  of  times  any  operation  is  performed.  Since  each  operation 
costs  O(lgJV)  time,  the  total  work  performed  on  the  territory  set  is  O(.VIgA'). 

It  remains  to  analyze  the  component  set  Q.  Each  start  event  causes  one  COLOR! 
operation,  and  each  end  event  causes  one  UNCOLOR!  and  REPRESENTATIVE  opera¬ 
tion.  Using  the  same  arguments  as  above  for  the  territory  set,  at  most  0( Ar)  RECOLOR! 
and  COLOR  operations  are  performed  throughout  the  whole  forward  scan.  Thus,  its 
contribution  to  the  overall  running  time  of  the  connected  components  algorithm  is  also 
0[N  Ig  N). 

1.8.  Remarks 

This  section  presents  the  important  extension  of  the  connected  components  algo¬ 
rithm  to  multiple  layers.  We  also  discuss  some  alternative  implementations  of  the  data 
structures  which  may  be  better  suited  to  a  practical  implementation. 

The  connected  components  problem  of  rectangles  in  the  plane  presented  in  this 
paper  is  a  simplification  of  the  problem  faced  in  computer-aided  design  of  VLSI. 


Computing  the  electrically  equivalent  rectangles  in  multiple  planar  layers  of  a  VLSI 
design  is  not  much  more  difficult  than  the  one-layer  problem  discussed  in  this  paper, 
even  though  contact  cuts  can  allow  components  to  snake  up  and  down  among  layers. 

To  find  the  connected  components  of  rectangles  on  multiple  layers,  we  simply  run 
a  copy  of  the  basic,  one-layer,  connected  components  algorithm  on  each  layer.  In 
the  forward  scan  each  layer  is  given  its  own  scan  set,  rectangle  set,  and  territory  set. 
The  component  set,  however,  is  global  to  the  entire  computation.  Each  contact  is 
represented  explicitly  on  the  layers  it  intersects.  In  the  back  scan,  both  the  counter 
and  the  rectangle  set  are  global.  No  further  changes  are  necessary. 

Some  of  the  data  structures  necessary  for  the  connected  components  algorithm  can 
be  implemented  more  practically  than  with  the  asymptotically  efficient  heigh t- balanced 
trees  presented  in  the  body  of  the  paper.  The  rectangle  set  /?,  for  example,  can  be 
implemented  by  hashing  on  the  rectangle  identification  number,  which  would  lead  to 
good  average  case  behavior.  At  the  cost  of  a  bit  more  complication,  the  component  set 
Q  can  be  implemented  with  a  union-find  structure  that  allows  O(.V)  merges  in  almost 
linear  time  [11]. 

The  scan  set  5  and  the  territory  set  T  can  be  implemented  by  using  bins,  as  has 
been  done  for  other  VLSI  algorithms  [2].  Each  bin  represents  a  fixed  portion  of  the 
scanline  and  contains  a  pointer  to  the  list  of  objects  that  overlap  that  bin.  A  desirable 
bin  size  can  be  chosen  based  on  statistical  informat  ion  about  the  VLSI  design.  The 
worst-case  running  time  of  the  algorithm  may  be  diminished,  however,  because  long, 
tall  rectangles  will  be  split  across  many  bins.  The  di/Tcrencc  between  this  approach 
and  an  intersection- based  approach,  such  as  [10]  may  be  negligible. 


2.  Interval  Trees  * 

The  scan  set  data  structure  is  central  to  any  algorithm  that  uses  scanning.  In 
particular,  an  algorithm  which  passes  a  scanline  over  a  set  of  rectangles  requires  a 
data  structure  to  manipulate  the  line  segments  that  the  scanline  induces  in  that  set 
of  rectangles.  The  INSERT!,  DELETE!,  and  FIND  operations  that  the  algorithm 
of  chapter  1  requires  are  a  subset  of  the  operations  required  by  other  scan-based 
algorithms  for  problems  involving  rectangles,  for  example  [3,  4]. 

Other  data  structures  have  been  developed  to  handle  INSERT!,  DELETE!,  and 
the  more  complicated  operation  of  ennumerating  all  segments  intersecting  a  given  test 
interval.  Each  of  these  data  structures  has  its  shortcomings.  The  segment  trees  of 
Bentley  and  Wood  [3]  require  O(nlgn)  space  for  n  segments.  McCreight’s  priority 
search  trees  [6]  require  only  0(n)  space,  but  they  are  quite  complicated.  Priority  search 
trees  arc  built  upon  height  balanced  trees.  Unfortunately,  updates  after  rotations 
require  0[ Ig  n)  time,  so  the  underlying  balanced  tree  structure  is  limited  to  those  which 
have  a  constant  number  of  rotations  on  each  insertion/deletion. 

The  three  operations  we  want — insert  a  segment,  delete  a  segment,  and  find  any 
segment  that  overlaps  a  test  interval — do  not  require  the  heavy  artillery  of  priority 
search  trees.  A  simple  modification  applied  to  any  height  balanced  tree  scheme  will 
suffice.  The  new  scheme,  interval  trees ,  requires  only  0(n )  space  and  it  performs  each 
of  the  three  operations  in  O(lgn)  time. 

Section  1  introduces  interval  trees.  Section  2  describes  the  insert  and  delete 
operations.  Section  3  gives  the  algorithm  for  the  find  operation  and  argues  that  it 
is  correct.  Section  4  offers  some  conclusions. 

2.1.  The  structure  of  Interval  Trees 

Interval  trees  represent  a  set  of  line  segments,  all  intervals  along  the  same  line. 
Each  segment,  s,  is  represented  as  an  ordered  pair  (51,32),  si  <  52,  where  «i  is  the 
minimum  point  of  the  line  segment  and  S2  is  the  maximum.  It  is  assumed  that  the 
minimum  points  of  all  segments  are  distinct.  It  they  are  not,  break  ties  with  the 
maximum  points  or  with  an  ID  attached  to  each  segment.  All  that  really  matters  is 
that  the  intervals  arc  distinguishable. 

To  implement  an  interval  tree,  start  with  any  balanced  search  tree  scheme:  AVL 
trees,  2-3  trees,  etc.  Set  up  the  search  tree  as  it  would  normally  be  set  up  using  the 
minimum  point  of  each  segment  as  the  search  key.  The  segments  may  be  stored  in 
internal  nodes  as  in  AVL  trees  or  stored  strictly  at  the  leaves  as  in  2-3  trees.  Now  add 
to  each  internal  node  a  range  interval  corresponding  to  the  minimum  and  maximum 
points  covered  by  any  segment  in  the  subtree  rooted  at  that  node.  The  minimum  point 
of  the  range  interval  for  an  internal  node  always  comes  from  its  leftmost  son  if  the 
tree  is  based  on  any  standard  search  tree.  However,  the  maximum  point  comes  from 
anywhere  in  the  subtree. 

Figure  9  shows  one  possible  implementation  of  an  interval  tree  for  a  set  of  seg¬ 
ments.  It  uses  a  2-3  tree  strategy  where  all  the  segments  are  stored  in  the  leaves. 
Figure  10  shows  another  underlying  scheme  more  like  an  AVL  tree  where  segments  are 


2.2.  Insertion  and  Deletion  of  Segments 

This  section  describes  how  to  insert  and  delete  segments  from  an  interval  tree.  A 
segment  can  be  inscrted/deleted  and  the  range  intervals  updated  in  time  0(tgn)  in  an 
interval  tree  with  n  segments.  We  describe  the  operations  in  some  detail  for  interval 
trees  implemented  with  underlying  tree  structures  similar  to  2-3  trees  and  AVL  trees. 

Insertion  and  deletion  of  a  segment  is  done  in  accordance  with  the  underlying 
balanced  tree  scheme.  In  AVL  trees  and  others  like  it,  balance  is  maintained  through 
rotations.  In  B  tree  schemes  such  as  2-3  trees,  it  is  maintained  by  node  splitting, 
sharing  keys  among  siblings,  etc.  In  either  case,  maintenance  of  the  range  intervals  is 
quite  easy.  This  is  because  once  the  range  intervals  of  a  node’s  children  are  established, 
calculation  of  its  range  interval  takes  a  constant  amount  of  time. 

Insertion  into  2-3  trees  involves  two  steps.  The  first  stage  is  a  search  to  find  the 
leaf  in  which  the  new  segment  belongs.  The  segment  is  then  inserted  into  that  leaf. 
In  the  second  stage,  this  leaf  splits  into  two  leaves  if  the  addition  of  the  new  segment 
made  the  leaf  too  full.  The  splitting  of  the  leaf  may  make  its  parent  overfull.  If  so, 
the  parent  splits,  and  so  on. 

To  update  range  intervals  during  the  first  stage,  fix  effected  intervals  on  the  way 
down  the  tree  during  the  search  for  the  appropriate  leaf.  That  is,  each  node  passed 
through  in  the  search  will  be  an  ancestor  for  the  new  segment,  so  if  the  new  segment 
has  a  minimum  point  lower  than  the  minimum  point  of  a  range  interval  or  a  maximum 
greater  than  its  maximum,  adjust  the  range  interval.  On  the  second  stage,  adjust  the 
interval  of  each  node  involved  in  splitting.  The  splitting  goes  from  the  bottom  of  the 
tree  up,  so  each  node’s  children  are  stable  by  the  time  it  splits.  Therefore  calculation 
of  a  new  range  interval  requires  a  constant  amount  of  time  for  those  nodes. 


Figure  11:  An  example  of  a  typical  rotation.  Circles  represent  nodes  of  a  tree. 
Triangles  represent  tress  of  0  or  more  nodes.  This  rotation  is  performed  if  both  of 
the  trees  I ■>  and  T3  have  height  h  and  tree  Tj  has  height  h  +  1.  McCreight  uses  this 
example  when  explaining  updates  to  priority  search  trees  after  rotations.  Priority 
search  trees  require  O(lgn)  time  for  each  rotation  which  restricts  the  number  of 
balanced  tree  schemes  suitable  as  underlying  structures.  Interval  trees  require  only 
a  constant  amount  of  time  per  rotation. 


Deletion  is  much  the  same  except  that  instead  o<  splitting,  there  is  coalescing.  THe 
same  reasoning  applies  here:  adjust  range  intervals  on  the  way  down  while  searching 
for  the  segment  to  delete,  and  adjust  each  node  involved  in  coalescing. 

Trees  that  use  rotations  to  adjust  balance  are  even  easier.  As  with  those  interval 
trees  with  underlying  2-3  trees,  adjust  range  intervals  on  the  way  down  the  tree  when 
making  an  inscrtion/deletion.  If  the  tree  is  unbalanced,  make  the  necessary  rotations. 
Figure  11  shows  a  typical  rotation.  The  circles  represent  nodes  with  segments  stored 
in  them.  The  triangles  represent  subtrees  of  0  or  more  such  nodes.  These  rotations 
cause  problems  with  priority  search  trees  because  the  subtrees  and  Tj  most 

change  when  updating  information  stored  in  the  nodes  ni  and  n?.  With  normal  march 
trees  and  interval  trees  this  is  not  the  case.  The  range  intervals  in  trees  T|,7t,  and  7j 
remain  unchanged.  Give  node  n;  the  old  range  interval  from  node  ni,  and  calculate 
a  new  range  interval  for  node  rij  using  segment  A  and  the  range  intervals  of  trees 
Ti  and  7V  Since  rotations  take  constant  time,  we  can  do  up  to  0(\gn)  of  them  on 
each  inscrtion/deletion  without  effecting  the  assympiolic  time  requirements  for  them 
operations. 

2.3.  The  FIND  operation 

This  section  gives  the  algorithm  for  the  FIND  operation  required  by  the  scan  set  of 
the  algorithm  presented  in  chapter  1.  It  also  contains  an  argument  that  this  algorithm 
is  correct. 

Given  a  test  interval  t  =  (t  j ,  t?)  and  an  interval  tree,  FIND(t| ,  (  j)  returns  a  segment 
in  the  interval  tree  which  covers  (overlaps)  the  interval  t  or  it  returns  nil  if  there  is 
no  such  segment.  By  appropriately  deGning  “covers'*  we  can  implement  open  or  closed 
endpoints  in  any  combination  for  both  the  segments  and  the  test  interval,  (eg.  open  test 
interval  with  half  open  segments  or  dosed  test  interval  with  open  segments).  We  can 
even  vary  within  the  segments  in  the  interval  tree  if  wc  want  to  make  the  processing 
a  bit  more  complicated  and  allow  two  extra  bits  of  information  per  segment.  The 
difference  between  open  and  closed  endpoints  is  merely  the  difference  between  a  test 
with  a  strict  “less  than"  and  a  test  with  “less  than  or  equal  to”. 

Given  the  root  of  an  interval  tree,  r,  and  a  test  interval  t,  to  FIND  a  covering 
segment  (if  any)  proceed  as  follows:  If  there  are  any  segments  stored  in  node  r  and  one 
of  them  covers  t,  return  that  segment  and  halt.  Otherwise,  look  at  the  range  interval 
of  node  r’s  leftmost  child.  If  this  range  interval  covers  test  interval  t,  then  recursively 
FIND  a  covering  segment  in  that  child.  Otherwise,  check  the  range  interval  of  node 
r’s  next  leftmost  child,  and  rccurse  to  that  node  if  its  range  interval  covers  the  test 
interval  t,  and  so  on.  If  node  r  contains  no  covering  segment  and  none  of  its  children 
have  range  intervals  that  cover  test  interval  t,  then  return  nil. 

To  see  why  this  algorithm  works,  let  us  look  at  a  binary  tree.  The  argument  for 
n-ary  trees  is  similar.  If  we  find  a  covering  segment  in  the  node,  then  we  succeed.  If 
there  is  no  covering  segment  in  the  node  and  none  of  the  children  have  range  intervals 
touching  the  test  interval,  then  there  is  no  hope;  the  tree  does  not  contain  a  covering 
segment,  so  we  should  return  nil.  The  only  subtle  point  is  that  if  the  range  interval 
of  the  left  child  covers  the  test  interval  but  a  recursive  search  in  the  left  child  fails  to 


yield  a  coming  segment,  then  a  search  of  the  right  child  must  fail  as  well.  This  allows 
us  to  follow  exactly  one  path  through  the  interval  tree  during  a  FIND  operation  so  the 
operation  costs  O(lgn)  time  in  the  worst  case  in  an  interval  tree  with  n  segments. 

Figure  12  shows  why  the  right  son  will  be  of  no  help  if  the  left  son's  range  interval 
covers  the  test  interval  but  the  recursive  search  fails.  Suppose  segments  e  and  b  art 
in  the  left  subtree  of  a  node.  During  the  FIND  operation  we  see  the  range  interval 
for  the  left  subtree  covers  test  interval  t,  but  the  search  ultimately  fails.  For  this  to 
happen,  interval  t  must  fall  into  a  gap  of  the  left  subtree  range  interval.  TV 
way  a  segment  in  the  right  subtree  could  cover  interval  t  would  be  if  there  w  >mc 
segment  e  that  reached  into  the  gap.  However,  that  would  mean  ei  <  fcj  which  .elates 
the  structure  of  the  search  tree.  If  cj  <  b\  then  segment  c  must  be  in  the  left  subtree 
if  segment  b  is. 

An  interesting  corollary  to  this  argument  is  that  if  the  range  interval  (f|,fj)  of  the 
left  son  overlaps  the  range  interval  (ri,r2)  of  the  right  son,  then  there  is  some  segment 
in  the  left  subtree  that  extends  through  the  entire  overlap  region.  The  overlap  region 
is  cither  (rl(l2)  or  (ri,r2)  if  r2  <  l2.  The  segment  (p,  l>)  which  registers  the  maximum 
point  in  the  range  interval  for  the  left  subtree  must  have  a  minimum  point  p  <  ri  so 
this  segment  covers  the  entire  overlap  region.  Consequently,  if  during  a  FIND  operation 
two  children  of  a  node  have  range  intervals  overlapping  the  test  interval,  then  we  know 
there  is  a  covering  segment  and  that  it  is  in  the  leftmost  child.  If  we  are  not  interested 
in  a  specific  segment,  but  only  care  to  know*  if  there  exists  a  covering  segment  or  not, 
wc  could  stop  at  this  point  with  an  affirmative  answer.  For  example,  this  is  all  that  is 
necessary  in  the  Guibas  and  Saxe  connected  components  algorithm. 


Figure  12:  If  a  and  b  are  segments  in  a  subtree  rooted  at  some  node  rt,  then  the 
subtree  rooted  at  node  n’s  right  sibling  cannot  contain  a  segment  c  reaching  into  the 
gap  between  segments  a  and  b.  That  would  viclate  the  strict  ordering  of  the  search 
tree  based  on  minimum  segment  endpoints. 


2.4.  Remarks 


This  section  wraps  up  the  discussion  of  interval  trees  by  applying  them  to  the 
scan  set  used  in  the  connected  components  algorithm  of  chapter  1  and  by  looking  at 
the  problem  of  ennumerating  all  segments  covered  by  a  test  interval. 

The  scan  set  of  chapter  1  is  supposed  to  contain  rectangles,  not  segments.  The 
problem  is  easily  remedied  by  using  pointers  to  rectangles  whose  y  interval  will  serve 
as  the  segments.  Our  insistence  on  distinguishability  is  motivated  by  the  scan  set. 
Removing  either  one  of  two  identical  segments  from  an  interval  tree  would  not  appear 
to  matter.  However,  removing  the  wrong  rectangle  from  the  scan  set  could  cause  many 
problems  with  the  connected  components  algorithm. 

The  interval  tree  is  sparser  than  Bentley  and  Wood's  segment  trees  and  simpler 
than  McCreight’s  priority  search  trees.  Any  existing  code  for  a  balanced  search  tree 
can  be  modified  to  implement  interval  trees  in  a  very  short  time.  Interval  trees  have 
only  the  linear  ordering  on  the  lower  end  of  the  interval  and  the  very  easily  maintained 
range  intervals.  They  do  not  maintain  any  heap-type  information  on  maximum  points 
which  is  what  causes  all  the  headaches  with  priority  search  trees. 

We  are  taking  advantage  of  the  special  case  operation  FIND.  In  fact,  interval 
trees  will  not  perform  as  well  as  segment  trees  or  priority  search  trees  for  general 
ennumeration.  All  efforts  to  modify  interval  trees  to  perform  ennumeration  have  lead 
back  to  variations  on  priority  search  trees. 


Direction*  for  Further  Research 

As  mentioned  in  the  introduction,  the  S zmanski-Van  Wyk  algorithm  for  connected 
components  assumes  the  same  machine  model  as  the  algorithm  presented  in  this  thesis. 
Their  algorithm,  however,  is  more  general.  It  is  designed  for  connectivity  analysis, 
computation  of  union,  intersection,  etc.  of  general  polygons.  Their  algorithm  uses  a 
two-pass  scan  assuming  an  edge  file  on  disk.  It  runs  in  0(/VW)  time  where  N  is  the 
number  of  edges  in  the  file,  and  uses  O(W)  space  where  W  is  the  maximum  number  of 
edges  to  cut  any  vertical  scanline. 

Neivergelt  and  Preparata  have  an  algorithm  for  such  geometric  operations  on  ar¬ 
bitrary  polygons  as  well  [7].  Their  algorithm  is  designed  to  run  entirely  within  primary 
memory,  but  its  assymptotic  running  time  is  better  than  that  of  the  Ssymanski-Van 
Wyk  algorithm.  Given  n  points  in  the  plane  and  some  connecting  line  segments  with  s 
segment  intersections,  the  Neivcrgelt-Preparata  algorithm  runs  in  time  0((n  +  s)lgn). 
In  the  special  case  where  the  points  and  lines  form  convex  polygons,  it  runs  in  0(nlgn+ 
s)  time.  Could  a  data  structure  such  as  the  territory  set  of  section  1.4  be  introduced 
into  the  Szymanski-Van  Wyk  algorithm  to  allow  it  to  run  in  worst  case  time  closer  to 
that  of  the  Neivergelt-Preparata  algorithm? 

The  geometric  problems  discussed  in  this  thesis  are  only  a  small  number  of  those 
problems  for  which  a  two-layer  model  may  prove  fruitful.  Many  geometric  problems  are 
incorporated  into  applications  programs  where  the  size  of  the  input  is  arbitrarily  large 
in  practice.  For  example,  the  convex  hull  is  used  in  simulating  chemical  reactions  and 
in  estimating  population  parameters  in  statistics  and  triangulation  is  used  in  numerical 
analysis  and  in  computer  aided  design  of  VLSI  circuits.  There  are  algorithms  for  these 
problems  in  the  literature  with  good  or  even  optimal  worst  case  time  bounds,  but 
these  algorithms  assume  all  data  is  available  in  primary  memory  at  all  times.  Are 
there  algorithms  for  these  problems  based  upon  the  two -layer  memory  model  which 
match  the  time  bounds  of  existing  algorithms  but  run  in  small  primary  memory  and 
guarentcc  good  paging? 

Finally,  is  there  a  data  structure  with  the  simplicity  of  interval  trees  that  will 
allow  ennumeration  of  all  k  segments  that  overlap  a  test  interval  in  time  0(lgn  +  k) 
where  n  is  the  number  of  segments  in  the  structure?  If  not,  is  there  a  structure  similar 
to  priority  search  trees  that  does  not  have  the  same  level  of  complexity  in  insertions 
and  deletions? 
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